Learned Vector-Space Models for Document Retrieval
نویسندگان
چکیده
Bellcore's Latent Semantic Indexing (LSI) system and HNC's MatchPlus system represent two attempts to model and exploit the inter-relationships among terms to improve information retrieval. Most information retrieval methods depend on exact matches between words in users' queries and words in documents. Typically, documents containing one or more query words are returned to the user. Such methods will, however, fail to retrieve relevant materials that do not share words with users' queries. One reason for this is that the standard retrieval models (e.g., Boolean, standard vector, probabilistic) treat words as if they are pairwise orthogonal or independent, although it is quite obvious that they are not. Consider, for example, the terms "automobile", "car", "driver", and "elephant". The terms "automobile" and "car" are synonyms, "driver" is a related concept, and "elephant" is pretty much unrelated. In most retrieval systems the query "automobile" is no more likely to retrieve an article about cars than one about elephants, if neither author uses precisely the term automobile. It would be preferable, however, if a query about automobiles would retrieve articles about cars and even articles about drivers to a lesser extent. A central theme of both the LSI and MatchPlus methods is that term-term inter-relationships like these can be explicitly modeled in the representation and automatically used to improve retrieval.
منابع مشابه
Improved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملProbability Bracket Notation, Term Vector Space, Concept Fock Space and Induced Probabilistic IR Models
After a brief introduction to Probability Bracket Notation (PBN) for discrete random variables in time-independent probability spaces, we apply both PBN and Dirac notation to investigate probabilistic modeling for information retrieval (IR). We derive the expressions of relevance of document to query (RDQ) for various probabilistic models, induced by Term Vector Space (TVS) and by Concept Fock ...
متن کاملAn Artificial Intelligence Approach To Information Retrieval
Document structure weighting is a technique whereby different parts of a document (title, abstract, etc.) contribute unevenly to the overall document weight during ranking. Near optimal weights can be learned with a GA. Doing so shows a statistically significant 5% relative improvement in MAP for vector space inner product and Croft’s probabilistic ranking, but no improvement for BM25. Two appl...
متن کاملLecture 5 : Introduction to ( Robertson / Spärck Jones ) Probabilistic Retrieval
In this lecture, we will introduce our second paradigm for document retrieval: probabilistic retrieval. We will focus on Roberston and Spärck Jones’ 1976 version, presented in the paper Relevance Weighting of Search Terms. This was an influential paper that was published when the Vector Space Model was first being developed — it is important to keep in mind the differences and similarities betw...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Inf. Process. Manage.
دوره 31 شماره
صفحات -
تاریخ انتشار 1995